Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingestion to write feature metrics for validation #449

Conversation

davidheryanto
Copy link
Collaborator

@davidheryanto davidheryanto commented Jan 29, 2020

What this PR does / why we need it:
This PR updates Feast Ingestion to write feature level metrics so feature values and presence can be validated. The schema for feature validation follows that from Tensorflow metadata https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto. The FeatureSpec and EntitySpec in FeatureSet.proto has been updated to partially support the validation schema in #438.

This PR only addresses the Milestone 1 requirements in https://docs.google.com/document/d/1TPmd7r4mniL9Y-V_glZaWNo5LMXLshEAUpYsohojZ-8

In order to fully make use of these metrics, it is assumed that:

  • StatsD collector is running so Feast ingestion job can push the feature level metrics via StatsD
  • A StatsD to Prometheus exporter is running to convert StatsD metrics to Prometheus (This is because other metrics in Feast, makes use Prometheus, so the conversion is to standardize the metrics type. We use StatsD in ingestion job because of the pull based behaviour of Prometheus https://cloud.google.com/blog/products/data-analytics/managing-and-monitoring-a-cloud-dataflow-setup)
  • The prometheus metrics are scraped regularly

A sample grafana-dashboard.json is also provided for a generic Grafana dasboard that users can import to use the Prometheus metrics. The alerts can then be configured accordingly. Sample dashboard looks like the following:

grafana-feast-feature-validation-2019-01-29-1

NOTE that FeaturePresence https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto#L544 defined in the Tensorflow metadata is used in the batch use case in the Tensorflow example. The scope for feature counts and fractions are all the features in a particular batch ingestion. In the continuous ingestion case having an unbounded source, however, the scope needs to be specified by the user. In the sample Grafana dashboard provided above, this scope is defined by the range selected by the user (the top right Last 5 minutes)

Which issue(s) this PR fixes:
Partially addresses #172

Does this PR introduce a user-facing change?:

NONE

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidheryanto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@davidheryanto davidheryanto changed the title Update ingestion metrics for validation Update ingestion to write Feature metrics for validation Jan 29, 2020
@davidheryanto davidheryanto changed the title Update ingestion to write Feature metrics for validation Update ingestion to write feature metrics for validation Jan 29, 2020
@davidheryanto
Copy link
Collaborator Author

/hold
I will add tests by calling Prometeheus API in test-end-to-end.sh stage to ensure that metrics are as expected.

@Yanson
Copy link
Contributor

Yanson commented Jan 29, 2020

A StatsD to Prometheus exporter is running to convert StatsD metrics to Prometheus (This is because other metrics in Feast, makes use Prometheus, so the conversion is to standardize the metrics type. We use StatsD in ingestion job because of the pull based behaviour of Prometheus https://cloud.google.com/blog/products/data-analytics/managing-and-monitoring-a-cloud-dataflow-setup)

Can this (and any other metrics in ingestion) be migrated to standard Beam custom metrics?

https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/metrics/Metrics.html

These will automatically end up in Stackdriver (when using Dataflow) and you can use a Stackdriver exporter for Prometheus:

https://github.com/helm/charts/tree/master/stable/stackdriver-exporter

@woop
Copy link
Member

woop commented Jan 29, 2020

@davidheryanto In addition to the test, can we please also add documentation to describe to users how they can benefit from this? Specifically

  1. The configuration options they have available
  2. Setting up the metric collection environment
  3. The metrics that can be collected and possible examples of them (as you have above)

@woop
Copy link
Member

woop commented Jan 29, 2020

A StatsD to Prometheus exporter is running to convert StatsD metrics to Prometheus (This is because other metrics in Feast, makes use Prometheus, so the conversion is to standardize the metrics type. We use StatsD in ingestion job because of the pull based behaviour of Prometheus https://cloud.google.com/blog/products/data-analytics/managing-and-monitoring-a-cloud-dataflow-setup)

Can this (and any other metrics in ingestion) be migrated to standard Beam custom metrics?

https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/metrics/Metrics.html

These will automatically end up in Stackdriver (when using Dataflow) and you can use a Stackdriver exporter for Prometheus:

https://github.com/helm/charts/tree/master/stable/stackdriver-exporter

Hi @Yanson, thanks for you input. I will let @davidheryanto and @zhilingc respond to you on this point in more detail, but we previously did use Beam custom metrics but I believe we moved away due to its metrics being scoped to a global window and the limited types of instrumentation compared to statsd/prom.

@zhilingc
Copy link
Collaborator

A StatsD to Prometheus exporter is running to convert StatsD metrics to Prometheus (This is because other metrics in Feast, makes use Prometheus, so the conversion is to standardize the metrics type. We use StatsD in ingestion job because of the pull based behaviour of Prometheus https://cloud.google.com/blog/products/data-analytics/managing-and-monitoring-a-cloud-dataflow-setup)

Can this (and any other metrics in ingestion) be migrated to standard Beam custom metrics?

https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/metrics/Metrics.html

These will automatically end up in Stackdriver (when using Dataflow) and you can use a Stackdriver exporter for Prometheus:

https://github.com/helm/charts/tree/master/stable/stackdriver-exporter

Thanks for the suggestion! I think its a viable option. Beam metrics behave very similarly to prometheus exporters, and as long as its exported out to prom (which can compute time windowed metrics using the delta) it should work just fine.

- Create 2 distinct services, one for metrics, one for statsd because they use different protocols (TCP and UDP) respectively and exposing this via Kube LoadBalancer won't work by default because most LoadBalancer only supports either TCP or UDP
- Allow users to define statsd_mapping from helm values
@davidheryanto davidheryanto force-pushed the update-ingestion-metrics-for-validation branch from 27f43fd to 2887e8c Compare February 9, 2020 14:09
- Include more filters for constraint validation (so no many to many results for subtrations)
- Use datasource: null so dashboard will use "default" datasource
- Unset uid so it will be autogenerated during import
@davidheryanto davidheryanto force-pushed the update-ingestion-metrics-for-validation branch from 2887e8c to 0b72ddd Compare February 9, 2020 14:35
@davidheryanto davidheryanto force-pushed the update-ingestion-metrics-for-validation branch from f179240 to 5593040 Compare February 10, 2020 01:36
@davidheryanto davidheryanto force-pushed the update-ingestion-metrics-for-validation branch 2 times, most recently from acbe497 to c4d66e7 Compare February 10, 2020 08:57
@davidheryanto davidheryanto force-pushed the update-ingestion-metrics-for-validation branch from c4d66e7 to 1a6ebc5 Compare February 10, 2020 08:59
@davidheryanto
Copy link
Collaborator Author

/hold cancel

@davidheryanto
Copy link
Collaborator Author

/retest

@woop
Copy link
Member

woop commented Feb 11, 2020

@Yanson I have created a separate issue to track an evaluation of Beam metrics to replace Statsd: #469

@woop
Copy link
Member

woop commented Feb 11, 2020

@davidheryanto waiting for the tests to pass before I evaluate this PR

"%%bash\n",
"# Sample data from BigQuery public dataset: bikeshare stations\n",
"# https://cloud.google.com/bigquery/public-data\n",
"wget https://raw.githubusercontent.com/davidheryanto/feast/update-ingestion-metrics-for-validation/examples/data_validation/bikeshare_stations.csv\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

davidheryanto?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because this PR is not merged yet. I will update it once merged

dashboards:
feast:
feast-ingestion-dashboard:
url: https://raw.githubusercontent.com/davidheryanto/feast/update-ingestion-metrics-for-validation/ingestion/src/main/resources/grafana-dashboard.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

davidheryanto again

sleep 20
tail -n10 /var/log/kafka.log
kafkacat -b localhost:9092 -L

echo "
============================================================
Installing Telegraf with StatsD input and Prometheus output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be changed to statsd_exporter as below.

- name: prometheus-statsd-exporter
version: 0.1.2
condition: prometheus-statsd-exporter.enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved out of core.


@presence.setter
def presence(self, presence: schema_pb2.FeaturePresence):
if not isinstance(presence, schema_pb2.FeaturePresence):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add comments for the setters?

self.value_count = feature.value_count

def update_domain_info(
self, feature: schema_pb2.Feature, schema: schema_pb2.Schema = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More missing comments

@@ -167,3 +170,94 @@ def test_add_features_from_df_success(
)
assert len(my_feature_set.features) == feature_count
assert len(my_feature_set.entities) == entity_count

def test_import_tfx_schema(self):
tests_folder = pathlib.Path(__file__).parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this depend on where pytest is launched from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh pathlib.Path(__file__) will give the correct relative path to test_feature_set.py. If pytest is called from different different context, the relative path will change to the correct location too.

@@ -11,21 +11,24 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also add the other missing tests like for your update methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will add it

@davidheryanto
Copy link
Collaborator Author

/hold
Updating charts

@woop
Copy link
Member

woop commented Feb 12, 2020

This PR is still missing
A fixed number of histogram bins are generated from the bounds provided by the user
based on the Feature Validation RFC

@ches ches added this to the v0.5.0 milestone Feb 14, 2020
@davidheryanto
Copy link
Collaborator Author

Closing this because the master has been updated to include writing metrics for feature values, which override with the changes here.
And will break the changes to smaller parts e.g. exclude the updates to Helm chart etc.

@ches
Copy link
Member

ches commented Mar 8, 2020

I believe the last comment refers to #486, for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants